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Abstract 

Algorithmic statistics considers the following problem: given a bi¬ 
nary string X (e.g., some experimental data), find a “good” expla¬ 
nation of this data. It uses algorithmic information theory to define 
formally what is a good explanation. In this paper we extend this 
framework in two directions. 

First, the explanations are not only interesting in themselves but 
also used for prediction: we want to know what kind of data we may 
reasonably expect in similar situations (repeating the same exper¬ 
iment). We show that some kind of hierarchy can be constructed 
both in terms of algorithmic statistics and using the notion of a pri¬ 
ori probability, and these two approaches turn out to be equivalent 
(Theorem [3]) . 

Second, a more realistic approach that goes back to machine learn¬ 
ing theory, assumes that we have not a single data string x but some 
set of “positive examples” xi,... ,xi that all belong to some unknown 
set A, a property that we want to learn. We want this set A to contain 
all positive examples and to be as small and simple as possible. We 
show how algorithmic statistic can be extended to cover this situation 
(Theorem [8|) . 

Keywords: algorithmic information theory, minimal description length, 
prediction, Kolmogorov complexity, learning. 
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1 Introduction and notation 


Let x be a binary string, and let A be a finite set of binary strings contain¬ 
ing X. Considering A as an “explanation” (statistical model) for x, we want 
A to be as simple and small as possible (the smaller A is, the more specific the 
explanation is). This approach can be made formal in the framework of al¬ 
gorithmic information theory, where the notion of algorithmic (Kolmogorov) 
complexity of a finite object (a string or a set encoded as a binary string in 
a natural way) is defined. 

The definition and basic properties of Kolmogorov complexity can be 
found in the textbooks 0 . 0 , for a short survey see [1]. Informally Kol¬ 
mogorov complexity of a string x is defined as the minimal length of a pro¬ 
gram that produces x. This definition depends on the programming language, 
but there are optimal languages that make the complexity minimal up to a 
constant; we fix one of them and denote the complexity of x by C{x). 

We also use another basic notion of the algorithmic information theory, 
the discrete a priory probability. Consider a probabilistic machine A without 
input that outputs some binary string and stops. It defines a probability dis¬ 
tribution on binary strings: niA^x) is the probability to get x as the output of 
A. (The sum of mA{x) over all x can be less than 1 since the machine can also 
hang.) The functions rriA can be also characterized as lower semi computable 
semimeasures (non-negative real-valued functions m(-) on binary strings such 
that the set of pairs (r, x) where r is a rational number, x is a binary string 
and r < m(x), is computably enumerable, and ^ 1)- There exists 

a universal machine U such that mu is maximal (up to O(l)-factor) among 
all mA- We fix some U with this property and call mu{x) the discrete a 
priori probability of x, denoted as m(x). The function m is closely related 
to Kolmogorov complexity. Namely, the value — log 2 m (x) is equal to C{x) 
with 0(logC'(x))-precision. 

Now we can define two parameters that measure the quality of a finite 
set A as a model for its element x: the complexity C{A) of A and the binary 
logarithm log 1^41 of its size. The first parameter measures how simple is 
our explanation; the second one measures how specific it is. We use binary 
logarithms to get both parameters in the same scale: to specify an element 
of a set of size N we need log N bits of information. 

There is a trade-off between two parameters. The singleton A = {x} is 
a very specific description, but its complexity may be high. On the other 
hand, for a n-bit string x the set A = B” of all n-bit strings is simple, but it 
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is large. To analyze this trade-off, following let us note that every set 

A containing x leads to a two-part description of x: hrst we specify A using 
C{A) bits, and then we specify x by its ordinal number in A, using log 1^41 
bits. In total we need C{A) log 1^41 bits to specify x (plus logarithmic 
number of bits to separate two parts of the description). This gives the 
inequality 

C{x) ^ C{A) + log |/l| + 0(logC'(A)) 

(the length of the optimal description, C{x), does not exceed the length of 
any two-part description). The difference 

6{x,A) = C{A) + \og\A\-C{x) 

is called optimality deficiency of A (as a model for x). As usual in algorithmic 
statistic, all our statements are made with logarithmic precision (with error 
tolerance O(logn) for n-bit strings), so we ignore the logarithmic terms and 
say that 6{x, A) is positive and measures the overhead caused by using two- 
part description based on A instead of the optimal description for x. 

Note that this overhead 5(x,A) is zero for A = {x}, so the question 
is whether we can obtain A that is simpler than x but maintains (5(x, A) 
reasonably small. This trade-off is reflected by a curve called sometimes 
that the profile of x; this prohle can be dehned also in terms of randomness 
dehciency (the notion of (a,/3)-stochasticity introduced by Kolmogorov, see 
0 . 0 ). and in terms of time-bounded Kolmogorov complexity (the notion 
of depth, see 0 )- 

In our paper we apply these notions to an analysis of the prediction 
and learning. In Section [2] we consider, for a given string x, all “good” 
explanations and consider their union. Elements of this union are strings 
that can be reasonably expected when the experiment that produced x is 
repeated. We show that this union has another equivalent dehnition in terms 
of a priori probability (Theorem [3]). 

In Subsection 12.51 we consider a situation where we start with several data 
strings Xi,... ,xi obtained in several independent experiments of the same 
type. We show that all the basic notions of algorithmic statistics can be 
extended (with appropriate changes) to this framework, as well as Theorem 

El 
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2 Prediction Hierarchy 

2.1 Algorithmic prediction 

Assume that we have some experimental data represented as a binary string x. 
We look for a good statistical model for x and hnd some set A that has small 
optimality dehciency 6{x,A). If we believe in this model, we expect only 
elements from A as outcomes when the same experiment is repeated. The 
problem, however, is that many different models with small optimality deh¬ 
ciency may exist for a given x, and they may contain different elements. If 
we want to cover all the possibilities, we need to consider the union of all 
these sets, so we get the following dehnition. In the following dehnition we 
assume that x is a binary string of length n, and all the sets A also contain 
only strings of length n. 

Definition 1. Let x G B"’ be a binary string and let d be some integer. The 
union of all hnite sets of strings A C B” such that x G A and 6{x,A) ^ d is 
called algorithmic prediction d-neighborhood of x. 

Obviously d-neighborhood increases as d increases. It becomes trivial 
(contains all n-bit strings) when d = n (then B"' is one of the sets A in the 
union). 

Example 1. If x = 0 .. .0 (the strings consisting of n zeros), then x' belongs 
to d-neighborhood of x iff C{x') < d 

Example 2. If x is a random string of length n (i. e. C{x) ^ n) then the 
d-neighborhood of x contains all strings of length n provided d is greater than 
some function of order O(logn). 

2.2 Probabilistic prediction 

There is another natural approach to prediction. Since we treat the experi¬ 
ment as a black box (the only thing we know is its outcome x), we assume 
that the possible models A C B” are distributed according to their a priori 
probabilities, and consider the following two-stage process. First, a hnite set 
is selected randomly: a non-empty set A is chosen with probability m (A) 
(note that a priori probability can be naturally dehned for hnite sets via 
some computable encoding). Second, a random element x of A is chosen 
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uniformly. In this process every string x is chosen with probability 

A^x 

and it is easy to see that this probability is equal to m (x) up to a 0(1)- 
factor. Indeed, the formula above dehnes a lower semicomputable function 
of x, so it does not exceed m (x) more than by 0(l)-factor. On the other 
hand, if we restrict the sum to the singleton {x}, we already get m (x) up 
to a constant factor. So this process gives nothing new in terms of the hnal 
output distribution on the outcomes x. Still the advantage is that we may 
consider, for a given pair of strings x and ?/, the conditional probability 

p(l/|x) = Pr[|/ G A I the output of the two-stage process is x]. 

In other words, by dehnition 


Y.ABx,y^A)l\A 

Y.abx^A)I\A ' 

As we have said, the denominator equals m (x) up to 0(l)-factor, so 



m (x) 

up to 0(l)-factor. Having some string x and some threshold d, we now can 
consider all strings y such that p{y\x) ^ (we use the logarithmic scale to 
facilitate the comparison with algorithmic prediction). These strings could 
be considered as plausible ones to appear when repeating the experiment of 
unknown nature that once gave x. 

Our main result shows that this approach is essentially equivalent to the 
algorithmic prediction. By a technical reason we have to change slightly the 
random process that dehnes p{y\x). Namely, it is strange to consider models 
that are much more complex than x itself, so we consider only sets A whose 
complexity does not exceed poly(n); any sufficiently large polynomial can be 
used here (in fact, An is enough). So we assume that the sums in ([1]) and ([2]), 
and in similar formulas in the sequel are always restricted to sets A C B” 
that have complexity at most 4n, and take this modihed version of ([T]) as a 
hnal dehnition for p{y\x). 
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Definition 2. Let x be a binary string and let d be an integer. The set 
of all strings y such that p{y\x) ^ 2~^ is called probabilistic prediction d- 
neighborhood of x. 

We are ready to state the main result of this section. 

Theorem 3. (a) For every n-bit string x and for every d the algorithmic 
prediction d-neighborhood is contained in probabilistic prediction d+0{logn)- 
neighborhood. 

(b) For every n-bit string x and for every d the probabilistic predic¬ 
tion d-neighborhood of x is contained in algorithmic prediction d + 0{\ogn)- 
neighborhood. 

The next section contains the proof of this result; later we show some its 
possible extensions. 

2.3 The proof of the Theorem [3] 

Proof of (a). This direction is simple. Assume that some string y belongs to 
the algorithmic prediction d-neighborhood of x, i.e., there is a set A contain¬ 
ing X and y such that C{A) + log |A| ^ C(x) -|- d. We may assume without 
loss of generality that d ^2n otherwise all n-bit string belong to probabilis¬ 
tic prediction d-neighborhood of x (take A = B”). Then the inequality for 
C{A) -|- log |A| implies that complexity of A does not exceed 4n, so the set 
A is included in the sum. This inequality implies also that 

m (x) ^ 

(as we have said, — logm(n) equals C{u) + 0(logC(M))). This fraction is 
one of terms in the sum that dehnes p(|/|x), so y belongs to the probabilistic 
prediction d -|- O(log n)-neighborhood of x. □ 

Before proving the second part (b), we need to prove a technical lemma. 
It is inspired by [6l Lemma 6] where it was shown that if a string belongs 
to many sets of bounded complexity, then one of them has even smaller 
complexity. We generalize that result as follows. 

Lemma 4. Assume that sets L and R consist of finite objects (in particular, 
Kolmogorov complexity C{v) is defined for v E L). Assume that R is has 
at most 2”' elements. Let G be a finite bipartite graph where L and R are 
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the sets of its left and right nodes, respectively. Assume that a right node x 
has at least 2^ neighbors of Kolmogorov complexity at most i. Then x has a 
neighbor of complexity at most i — k + 0{C{G) + log(fc + i + n)). Here C{G) 
stands for the length of the shortest program that given any v & L outputs a 
list of its neighbors. 

Proof. Let us enumerate left nodes that have complexity at most i. We start 
a selection process: some of them are marked (=selected) immediately after 
they appear in the enumeration. This selection should satisfy the following 
requirements: 

• at any moment every right node that has at least 2^ neighbors among 
enumerated nodes, has a marked neighbor; 

• the total number of marked nodes does not exceed 2'’~^p{i,k,n) for 
some polynomial p (hxed in advance). 

If we have such a selection strategy of complexity C{G) + 0(log(i + k + n)), 
this implies that the right node x has a neighbor of complexity at most 

i - k + 0{C{G) + \og{k + i + n)), 

namely, any its marked neighbor (that marked neighbor can be specihed by 
its number in the list of all marked nodes). 

To prove the existence of such a strategy, let us consider the following 
game. The game is played by two players, who alternate moves. The maximal 
number of moves is 2*. At each move the hrst player plays a left node, and 
the second player replies saying whether she marks that node or not. The 
second player loses if the number of marked nodes exceeds + 1) ln2 

or if after some of her moves there exists a right node y that has at least 
2^ neighbors among the nodes chosen by the hrst player but has no marked 
neighbor. (The choice of the bound 2'^~^^^{n + 1) ln2 will be clear from the 
probabilistic estimate below.) Otherwise she wins. 

Assume hrst that the set L of left nodes is hnite (recall that the set of 
right nodes is hnite by assumption). Then our game is a hnite game with full 
information, an hence one of the players has a winning strategy. We claim 
that the second player can win. If it is not the case, the hrst player has a 
winning strategy. We get a contradiction by showing that the second player 
has a probabilistic strategy that wins with positive probability against any 
strategy of the hrst player. So we assume that some strategy of the hrst 


7 


player is fixed, and consider the following simple probabilistic strategy of 
the second player: every node presented by the hrst player is marked with 
probability p = 2~^{n + l)ln2. The expected nnmber of marked nodes is 
p2* = 2*“^(n + 1) In 2. By Markov’s ineqnality, the nnmber of marked nodes 
exceeds the expectation by a factor of 2 with probability less than So it 
is enongh to show that the second bad case (after some move there exists a 
right node y that has 2*’ neighbors among the nodes chosen by hrst player 
bnt has no marked neighbor) happens with probability at most 

For that, it is enongh to show that for every node right node y the prob¬ 
ability of this bad event is less than ^ divided by the nnmber |i?| of right 
nodes. Let us estimate this probability. If y has 2^ (or more) neighbors, the 
second player had (at least) 2^ chances to mark its neighbor (when these 2^ 
nodes were presented by the hrst player), and the probability to miss all 2^ 
these chances is at most (1 —pY ■ The choice of p guarantees that this prob¬ 
ability is less than ^ (l/2)/|i?|. Indeed, using the bound 1 — x ^ e~^, 

it is easy to show that 

(1 ^ ^ 2-^-1 _ 

We have proven that the winning strategy exists but have not yet es¬ 
timated is complexity. A winning strategy can be found be an exhaustive 
search among all the strategies. The set of all strategies is hnite and the 
game is specihed by G, i and k. Therefore the complexity of the hrst found 
winning strategy is at most C{G) + 0(log(i + k)). 

Thus the Lemma 0] is proven in the case when L is a hnite set. To extend 
the proof to general case, notice that the winning condition depends only 
on the neighborhood of each left node. The worst graph for the the second 
player is the following “model” graph. It has left nodes and 2” right 

nodes and each of 2^" possible neighborhoods is shared by 2* left nodes. A 
winning strategy for such a graph can be found from n, i and k and hence 
its complexity is logarithmic in n + i + k. That strategy can be translated 
to the game associated with the initial graph, this translation increases the 
complexity by C{G), as we have to translate each left node played by the 
hrst player to a left node of the model graph. □ 

Having in mind future applications in Subsection 12.41 we will consider in 
the next statement an arbitrary decidable family A of hnite sets though in 
this section we need only the case when A contains all hnite sets. 



Corollary 5. Let A be a decidable family of finite sets. Assume thatxi,... ,xi 
are strings of length n. Denote by all subsets o/B"' of complexity at most 
m. Then the sum 


S := 


E 

AgA!^, xi,...,xieA 


m (A) 


eguals to its maximal term up to a factor 0 / 


Proof of the corollary. Let M denote the maximal term in the sum S. Ob¬ 
viously the sum S is equal to the sum over i ^ m and j ^ n of sums 


m(dl) 

^ I j4| 

aga^ ' ' 

C{A)=i 

logO|=j 

Xi,...,XiGA 


( 3 ) 


As there are (m -|- l)(n + 1) such sums, we only need to prove that each 
sum (|3]) is at most M . 2 A^°^n+m+i) _ other words, we have to show that for 
all z, j there is a set iL G A^ with xi,..., x; G A such that is greater 

than the sum ([3]) up to a factor of 

To this end fix i and j. Since m (m) = the sum ([3]) 

equals 



2—C{A)—log m+ 0 (log(n-|-m)) 


C{A)=i 

iogO|=i 

xi,...,xi€.A 



2-i-j+0(log(n+m)) 


C{A)=i 

iogOI=i 

xi,...,xiGA 


( 4 ) 


All the terms in the sum (|4]) coincide and thus the sum (|1]) is equal to 
2 -t-j+oiiog{n+m)) number of sets A G A^ with C{A) = z, log | A| = j, 

Xi, ..., Xz G A. Let k denote the floor of the binary logarithm of that number. 

Consider the bipartite graph whose left nodes are finite subsets from A"' 
of cardinality at most 2A right nodes are /-tuples of rz-bit strings and a left 
node A is adjacent to a right node (xi, ... ,xi) if all xi,...,x; are in A. The 
complexity of this graph is 0{log{n + l+j)) and the logarithm of the number 
of right nodes is nl. By Lemma 0] there is a set H G A^ of log-size j and 
complexity at most i — k + 0{\og{i + j + k + n + l)) = i — k + Oifogif + m + n)) 
with xi,..., Xz G A. The fraction is equal to up to a factor of 

20{\og(n+m+l)) 
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Recall that the sum (|1]) 

equals to 2^2“*“-^ up to the same factor and thus we are done. 


□ 


Remark 1. Consider the following case of Corollary |5l A is the family off all £- 


nite subsets, / = 1. As was shown in Subsection l2.2l the sum ™ 

is equal to m (x) up to a constant factor. 

By this reason, we expect that the accuracy in the corollary can be im¬ 
proved. 

Proof of (b). Let y be some string that belongs to probability prediction 
d-neighborhood for x. According to ([2]), it implies that 



Now we will use Corollary [5] for / = 2, xi = x, X 2 = y, m = An and the 
family of all sets as A. By this corollary there is a set A 3 x,y such that 
m (A)/|A| = so: C'(A)-|-log |A| — C(a;) ^ d-|-0(log?7,), i. e. 

y belongs to the algorithmic prediction d -|- 0(log?7,)-neighborhood of x. □ 

2.4 Sets of restricted type 

In some cases we know a priori what sets could be possible explanations, and 
are interested only in models from this class. To take this into account, we 
consider some family A of hnite sets, and look for sets A in .4. that contain 
the data string x and are “good models” for x. This approach was used in 
[6] ; it turns out that many results of algorithmic statistics can be extended to 
this case (though sometimes we get weaker versions with more complicated 


proofs). 


In this section we show that Theorem [3] also has an analog for arbitrary 
decidable family A. The family of all subsets of that belong to A is 
denoted by A". 

First we consider the case when for each string x the set A contains the 
singleton {x}. 

Let us dehne probability prediction neighborhood for a n-bit string x with 
respect to A. Again we consider a two-stage process: hrst, some set of n-bit 
strings from A is chosen with probability m (A). Second, a random element 
in A is chosen uniformly. Again, we have to assume that we choose sets 
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whose complexity is not greater than 4n. A value PA{y\^) is then dehned as 
the conditional probability of ?/ G A with the condition “the output of the 
two-stage process is x”: 


VA{.y\x) 




( 5 ) 


Here the sum is taken over all sets in A” that have complexity at most 4n. 

Again as in Subsection 12.21 the denominator equals m (x) up to 0(1)- 
factor (because {x} G A), so: 


PA{y\x) 




m (x) 


( 6 ) 


up to 0(l)-factor. 

Then A-probabilistic prediction d-neighborhood is dehned naturally: a 
string y belongs to this neighborhood if p^(|/|x) ^ 2“'^. The A-algorithmic 
prediction d-neighborhood for x is dehned as follows: a string y belong to it 
if there is a set A 3 x, ?/ that belongs to A"" such that (5(x, A) ^ d. 

Now we are ready to state an analog of Theorem [3l 


Theorem 6. Let A be a decidable family of binary strings containing all 
singletons. Then: 

(a) For every n-bit string x and for every d the A-algorithmic predic¬ 
tion d-neighborhood is contained in A-probabilistic prediction d -\- O(logn)- 
neighborhood. 

(b) For every n-bit string x and for every d the A-probabilistic prediction 
d-neighborhood of x is contained in A-algorithmic prediction d -|- O(logn)- 
neighborhood. 

Proof of (a). The proof is similar to the proof of Theorem[3] (a). Assume that 
a string y belongs to the algorithmic prediction d-neighborhood for x, i.e., 
there is a set A G A"" containing x and y such that C(A) -|-log |A| ^ C{x)+d. 
If d > 3n, then the statement is trivial. Indeed, there is a set A' G A"' that 
contains x and y such that 6{x,A') ^ 3n. To prove this, we can not set 
A' = B”' any more, as this set may not belong to A. However we may let 
A' be the hrst set in A"', that contains x and y. The complexity of this set 
is not greater than |x| -|- \y\ ^ 2n and log-size is not greater than n. Thus 
d(x,A') ^ 3n. The rest of the proof is completely similar to the proof of 
Theorem E] (a). □ 
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Proof of (b). The proof is similar to the proof of Theorem [3] (b). □ 

Now we state and prove Theorem [6] in general case (for families A that 
may not contain all singletons). In the case x G where n = |x|, 

the definition of ^-probability prediction neighborhood remains the same. 
Otherwise, if x ^ IJ A"', the string x can not appear in the two-stage process, 
so in this case we dehne ^-probability prediction d-neighborhood for x as 
the empty set for every d. Notice, that now we can not rewrite ()3|) as (131) 
becanse {x} may not belongs to A. 

Now we define ^-algorithmic prediction neighborhood. There is a snbtle 
point that shonld be taken into acconnt: it may happen that there is no set 
A ^ A containing x snch that (5(x, A) ^ 0. By this reason we inclnde in the 
algorithmic prediction neighborhood of x the nnion of all sets A in A, snch 
that (5(x, A) is as small as it is possible: 

Definition 3. Let x G B" be a binary string, let d be some integer and let A 
be some family of sets. The nnion of all finite sets in A^ snch that x G A and 
every B G A'^ that contains x satisfies the ineqnality: (5(x, A) ^ (5(x, B) + d 
is called A-algorithmic prediction d-neighborhood of x. (In other words, d- 
neighborhood inclndes all sets A whose S{x,A) is at most d more than the 
minimnm.) 

Theorem 7. Let A be a decidable family of binary strings. Then: 

(a) For every n-bit string x and for every d the A-algorithmic predic¬ 
tion d-neighborhood is contained in A-probabilistic prediction d + O(logn)- 
neighborhood. 

(b) For every n-bit string x and for every d the A-probabilistic prediction 
d-neighborhood of x is contained in A-algorithmic prediction d -|- O(logn)- 
neighborhood. 

Notice that if x then both algorithmic and prediction neighbor¬ 

hoods are empty and the statement is trivial. Therefore in the proof we will 
assnme that this is not the case. 


Proof of (a). The proof is completely similar to the proof of Theorem ini □ 


Proof of (b). Let y be some strings that belongs to probability prediction 
d-neighborhood for x, that is. 


E 

ABx^y 


m (A) 

w 




ABx 


m (A) 


( 7 ) 
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Let 

= argmax{m (74)/|74| | x e A G A^} 

and 

A^y = argmax{m (y4)/|y4| | x, ?/ G A G A'^}. 

Recall that = 2 ~^A)-^oz\a\ ^ 20 (iog’^) factor) and by Corollary O 

the snms in both parts of the eqnality are eqnal to their largest terms (again 
np to factor). Therefore, 

2-C(Ax,y)-\o^\Ax^y\ 2-d'-0{^ogn)2-C{Ax)-\og\Ax\ 

which means that 6{x,Ax^y) ^ 6{x,Ax) + d + 0(log?7,). Hence y belongs 
^-algorithmic prediction d -|- 0(logn)-neighborhood of x. □ 

2.5 Prediction for several examples 

Consider the following sitnation: we have not one bnt several strings Xi,... ,xi G 
B” that are experimental data. We know that they were drawn independently 
with respect to the nniform probability distribntion in some nnknown set A. 
We want to explain these observation data, i. e. to hnd an appropriate set 
A. Again we measnre the qnality of explanations by two parameters: C{A) 
and log \A\. 

In this section we will extend previons resnlts to this scenario. Again we 
assnme that we know a priori which sets conld be possible explanations. So, 
we consider only sets from a decidable family of sets A. 

Let denote the tnple xi,..., x^. Let A C B” be a set that contains all 
strings from . Then we can restore from A and indexes of strings from 
in A and hence we have : 

C{lt) ^ C{A) + I log \ A\ + 0{logn). 

Therefore it is natnral to dehne the optimality deficiency of A 3 by the 
formnla 

6{lt, A) := C{A) + I log |A| - C{lt). 

The dehnitions of the A-algorithmic prediction d-neighborhood of the tnple 
Ht is obtained from Dehnition [3] by changing x to 

In a similar way we modify the dehnition of the A-probabilistic prediction 
neighborhood. Again we consider a two-stage process: hrst, a set of n-bit 
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strings from A is chosen with probability m (A). Second, I random elements 
in A are chosen uniformly and independently on each other. Again, by 
technical reason, we assume, that we consider only sets whose complexity 
is not greater then (/ + 3)n. The value PA{y\~^) is dehned as the conditional 
probability of y E A under the condition [the output of this two-stage process 
is equal to 1^]\ 


PA{y\'^) 




Here both sums are taken over all sets A G A"' that have complexity at 
most n{l -|- 3). (If no such set contains x then PA{y\'^) = 0-) By dehnition, 
a string y belongs to ^-probabilistic prediction d-neighborhood for if 
PA{y\'^) > 2 -"*. 

Now we are ready to state an analog of Theorem [71 


Theorem 8. Let A be a decidable family of binary strings. Then: 

(a) For every I n-bit strings and for every d the A-algorithmic predic¬ 
tion d-neighborhood is contained in A-probabilistic prediction d -|- 0(log(n -|- 
l))-neighborhood ofl^. 

(b) For every I n-bit strings it and for every d the A-probabilistic pre¬ 
diction d-neighborhood of it is contained in A-algorithmic prediction d -|- 
0(log(n -|- 1))-neighborhood oflt. 

Proof. The proof is entirely similar to the proof of Theorem [71 but now 
Corollary El is applied for I and /-|-1 strings so the accuracy becomes 0(log(n-|- 
/)). □ 


3 Non-uniform probability distributions 

We have considered so far only uniform probability distributions as statistical 
hypotheses. The paper [71 Appendix II] justihes such a restriction: it was 
observed there that for every data string x and for probability distribution 
P there is a hnite set A 9 a: that is not worse than P as an explanation for x 
(with logarithmic accuracy). However, if the data consists of more than one 
string, then this is not the case. Now, we will explain this in more details. 

The quality of a probability distribution P as an explanation for the data 
Xi,... ,xi is measured be the following two parameters: 

• the complexity C{P) of the distribution P, 
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• — log(P(xi)... P{xi)) (the smaller this parameter is the larger is the 
likelihood to get the tuple by independently drawing I strings with 
respect to P). 

We consider only distributions over hnite sets such that the probability 
of every outcome is a rational number. The complexity of such a distri¬ 
bution is defined as the complexity of the set of all pairs {y,P{y)) ordered 
lexicographically. 

If P is a uniform distribution over a finite set A then the first parameter 
becomes C{A) and the second one becomes —/log 1^41. If / = 1 then for 
every pair x,P there is a finite set A3 x such that both (^(A), log 1^41 are 
at most C{P), — \ogP{x) with the accuracy 0(log|a:|). Indeed, let A = B” 
if P{x) ^ 2“"' and 

A = {x e B" I P(x) ^ 2"*} 

if 2“* ^ P{x) < 2“*+^ ^ 2“"'. In both cases we have C{A) ^ C{P) + 0{\ogn) 
and log l^l ^ — log P{x) + 1. 

For I = 2 this is not the case: 

Example 9. Let xi he a random string of length 2n and X 2 = 00... Oy be a 
string of length 2n where y is a random string of length n independent of Xi 
(that is, C{xi,X 2 ) =3n + 0{1)). A plausible explanation of such data is the 
following: the strings xi,X 2 were drawn independently with the respect the 
distribution P where half of the probability is uniformly distributed over all 
strings of length 2n and the remaining half is uniformly distributed over all 
strings of length 2n starting with n zeros. The complexity of this distribution 
P is negligible (0(logn)) and the second parameter —\og{P{xi)P{x 2 )) is 
about 3n. On the other hand there is no simple set A containing both strings 
Xi,X 2 with 2 log |A| being close to 3n. Indeed, for every set A containing Xi 
we have C{A) -|- log l^l ^ 3n — O(logn) and hence 2 log l^l ^ 6n — 2C(A) — 
O(logn) S> 3n (the last ineguality holds provided C{A) is small). 

Therefore we will not restrict the class of statistical hypotheses to uniform 
distributions. We will show that the main result of [7] (Theorem [TT] below) 
translates to the case of several strings, i.e., to the case / > 1 /Theorem WI\ 
below). 
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3.1 The profile of a tuple xi,..., x/ 

Fix xi,... ,xi G B”. As above, we will denote by the tuple xi,... ,xi. The 
optimality dehciency is dehned by the formula 

6 {lt, P) = C{P) - log{P{xi)... P{xi)) - C{lt). 

This value is non-negative up to 0(log(n + 1)), since given P and I we can 
describe the tuple in — log(P(a;i)... P{xi))+0{1) bits, using the Shannon- 
Fano code. 

Definition 4. The prohle P-^ of the tuple is dehned as the set of all pairs 
(a, b) of naturals such that there is a probability distribution P of Kolmogorov 
complexity at most a with P) ^ h. 

Loosely speaking, a tuple of strings is called stochastic if there is a 
simple distribution P such that 6 {lt,P) ~ 0. In other words, if (a, fe) G 
P-^ for a, 6 ~ 0. Otherwise it is called non-stochastic. In one-dimensional 
case non-stochastic objects were studied, for example, in [10], [7|. However, 
in the one-dimensional case we can not present explicitly a non-stochastic 
object. In the two-dimensional case the situation is quite different: let Xi be 
a random string of length n and let x^ = Xi. For such pair Xi^X 2 there is no 
simple distribution P with small 6 {{xi,X 2 ), P). Indeed, for any probability 
distribution P we have C{P) — log P{xi) ^ C(xi) = n for f = 1,2 (with 
accuracy O(logn)). Adding these inequalities we get 

2C{P) — \og{P{xi)P{x 2 )) ^ 2n. 

Hence (5((a;i, a; 2 ), P) ^ 2n — C{P) —C{xi,X 2 ) = n — C{P), which is very large 
provided C{P) n. 

In general, if strings Xi and X 2 have much common information (i. e. 
C{xi,X 2 ) <C C{xi) -|- C{x 2 )), then the pair {xi,X 2 ) is non-stochastic. There 
is also a non-explicit example of a non-stochastic pair of strings: consider 
any pair whose hrst term is non-stochastic. There is no good explanation for 
the hrst term, hence there is no good explanation for the whole pair. 

The hrst example suggests the following question: is the prohle of the pair 
of strings Xi,X 2 determined by C{xi),C{x 2 ),C{xi,Xx), Pxi, Px 2 P]^xi,x 2 \^ 
Here [xi,X 2 \ denotes the concatenation of strings xi and X 2 . Notice that 
-Ppi,x 2 ] denotes the 1-dimensional prohle of the string [xi,X 2 ] and is not to be 
confused with Pxi,x 2 i which is the 2-dimensional prohle of the pair of strings 
Xi,X 2 . The following theorem is the main result of Section [3l It provides a 
negative answer to this question. 
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Theorem 10. For every n there are strings Xi, X 2 , yi and y 2 of length 2n 
such that: 

1) The sets and Py^, Px 2 and Py^, P[xi,x 2 \ and P[yj^,y 2 ] are at most 
O(logn) apart. 

2) C{xi) = C{yi) +0{\ogn), C{x 2 ) = C{y 2 ) +0{\ogn), C{xi,X 2 ) = 
C{yi,y 2 ) + 0{\ogn). 

3) However the distance between Pxi,x 2 and Pyj^,y 2 is greater than 0.5n — 
O(logn). (We say that the distance between two sets R and Q is at most e 
if R is contained in e-neighborhood, with respect to L^o-norm, of Q, and vice 
versa.) 

The proof of this theorem is presented in Appendix. 

3.2 Randomness deficiency 

In this subsection we introduce multi-dimensional randomness dehciency and 
show that the main result of |7] relating 1-dimensional randomness deficiency 
and optimality deficiency translates to any number of strings. 

The 1-dimensional randomness deficiency of a string x in a finite set A 
was defined by Kolmogorov as d{x\A) = log \A\ — C{x\A). It is always non¬ 
negative (with 0(log |a;|) accuracy), as we can find x from A and the index 
of X in A. For most elements x in any set A the randomness deficiency of 
x in A is negligible. More specihcally, the fraction of x in A with random¬ 
ness dehciency greater than (3 is less than 2“^. The randomness dehciency 
measures how non-typical looks x in A. 

Definition 5. The set of all pairs (a, 6) such that there is a set A 3 x of 
complexity at most a and (i(x|A) ^ 6 is called the stochasticity profile of x 
and is denoted by Qx 

To distinguish prohles Px and Qx we will call Px the optimality profile in 
the sequel. Surprisingly, the sets Px and Qx almost coincide: 

Theorem 11 ([7]). For every string x of length n the distance between Px 
and Qx is at most O(logn). 

The multi-dimensional randomness dehciency is dehned in the following 
way. For a tuple of strings = Xi,..., x; and a distribution P let 

d{lt\P) = -log(P(xi). ..P{xi)) - C{xi,.. .,xi\P). 
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If / = 1 and P is a uniform distribution in a finite set then this dehnition is 
equivalently to the one-dimensional case. The randomness dehciency mea¬ 
sures how implausible is to get xi,... ,xi as a result of I independent draws 
from A. The set off all pairs (a, b) such that there is a distribution P of com¬ 
plexity at most a and d{l^\P) ^ 6 is called the I-dimensional stochasticity 
profile of it and is denoted by Q^. 

It turns out that Theorem [m translates to multi-dimensional case: 

Theorem 12. For every tuple it = xi,...,xi of strings of length n the 
distance between sets P-^ and is at most 0(Iog(n -|- /)). 

The proof of this theorem is presented in Appendix. 

Remark 2. Theorem [12] is basically an analog of Theorem [11] for a restricted 
class of distributions, namely, for product distributions Q on /-tuples, i.e., 
distributions of the form Q(xi, ... ,xi) = P{xi) ■ ■ ■ P{xi). A natural question 
is whether Theorem [TT] can be generalized to any decidable class of distribu¬ 
tions. This is indeed the case and the proof is very similar to the proof of 
Theorem [12] (presented in Appendix). 


An open question 

Can we improve the accuracy in Corollary [5] from to 
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4 Appendix 

Proof of Theorem\T^ Our example is borrowed from [1], where there are sev¬ 
eral examples of pairs of strings with non-extractable common information. 
All of the examples except one are stochastic pairs of strings and we need 
any stochastic such example. 

Consider a finite field F of cardinality 2"' and a plane (two-dimensional 
vector space) over F. Let yi be a random line on this plane, and y 2 be a 
random point on this line. Then 

C{yi) = 2n, C{y 2 ) = 2n, C{yuy 2 ) = 3n 
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(everything with logarithmic accuracy). These strings |/i ,|/2 have about n 
bits of common information. On the other hand [H Theorem 8] states the 
following: 

Theorem 13 ([I]). There is no z such that C{z) = n + 0(logn), C{iii\z) = 
n + O(logn), C{ii 2 \z) = n + O(logn) (such a string z could he considered 
as a representation of the common information in yi,y 2 )- Moreover, for all 
strings z we have 

C(z) + C(yijz)/2 + max{0(|/i|z)/2, C{y 2 \z)} ^ 3n - O(logn), (8) 
C{z) + C{y 2 \z )/2 + m&-x{C(y 2 \z)/ 2 , C{yi\z)} ^ 3n - O(logn). (9) 

Let us first show that inequalities ([HD and (jUD imply that 

C{z) + C{yi\z) + C{y 2 \z) > min{4n — C{z)/?t, 5n — C{z)} — O(logn). 

( 10 ) 

Indeed, if C{yi\z) and C{y 2 \z) differ at most 2 times from each other, then 
the maximum in both inequalities ([HD and ([HD is equal to the second term 
and summing ([HD and ([9D we get 

2C{z) + ?>C{yi\z)/2 + ?)C{y 2 \z)/2 ^ 6 n — 0(log?7,), 

which can be re-written as 

C{z) + C{yi\z) + C{y 2 \z) ^ An - C{z)/3 - 0{\ogn). 

Otherwise, when say C(yi\z) > 2 C{y 2 \z), the maximum in inequality ([HD is 
equal to the first term. Then we sum that inequality with the inequality 
C{z) + C{y 2 \z) ^ 0 ( 1 / 2 ) = 2 n and obtain the inequality 

20 ( 2 ;) C{yi\z) + C{y 2 \z) ^ 5n - O(logn), 

which can be re-written as 

C{z) + C{yi\z) + C{y 2 \z) ^ 5n - C{z) - 0{hgn). 

Thus in both cases we obtain flTUp . 

This implies that the optimality profile Pyj^,y 2 of the pair of strings existing 
by Theorem [13] has the following property 

{cL,h) E Py^^y 2 6 ^ min{?7, — a/3, 2n — a} — O(logn). (11) 
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Indeed, for every probability distribution P we have C{y\P) ^ — log P{y) + 
0(1) and hence 

5((2/i, 2 / 2 ), P) ^ 0(P) + C{y,\P) + Ciy^lP) - 3n - 0(1). (12) 

Combining inequality (TTU]) foYz = P and inequality (IT^ we obtain ffTT]) . 

Thus the optimality prohle of the pair yi,y 2 does not contain the pair 
(1.5n, 0.5n — O(logn)). On the other hand, all the strings 2 / 1 , 2 / 2 , [yi,y 2 ] are 
stochastic, that is, the sets Py^i Py 2 i P,[yi,y 2 \ contain almost all pairs (a, 6) 
(more specihcally, all pairs with a,b ^ O(logn)). 

It is easy to construct another pair of strings Xi,X 2 that has the same 
properties except that the pair (n+0(l), 0(1)) is inside Pxi,x 2 - To this end let 
Xi,X 2 be random strings of length 2n that share hrst n bits: Xi = x*x\, X 2 = 
x*X 2 and C{x*x*iX 2 ) = 3n + 0(1). Then again C{xi) = 2n + 0(1), 0 ( 0 : 2 ) = 
2n + 0(1), C{xiX 2 ) = 3n + 0(1). And again all the strings xi,X 2 , [xi,X 2 \ 
are stochastic. To show that the pair (n + 0(logn), O(logn)) is inside Pxi,x 2 ^ 
consider the uniform distribution P on all strings of length n whose hrst 
half is equal to x*. This distribution has the same complexity as x*, that is, 
C{P) = n + 0(1) and hence 0(P) — logP(a:i) — logP(a: 2 ) = 3n + 0(1) = 
0 ( 0 : 1 , 0 : 2 ). Hence even the pair (n + 0(1), 0(1)) belongs to Pxi,x 2 - D 

Proof of TheoremllM The proof is similar to the proof of Theorem ITTl First 
notice that for every distribution P we have d{l^\P) ^ P) + 0(log(n + 
/)). Indeed: 

d(^|P) = - log(P(xi)... P(o:„)) - 0(^|P) 

^ - log(P(o:i)... P{xn)) + 0(P) - O(^) = <5(^, P). 

Therefore the set includes the set P-^ (with accuracy 0(log(n + /))). 

It remains to show the inverse inclusion. From the above inequalities it 
is clear that the difference between 6(1^, P) and d(l^\P) equals 

(0(P) - 0(^)) + 0(^|P) = 0(P|^), 

where the equality follows from the Symmetry of information (see, e.g. 0)- 
It turns out that if C{P\1^) is large then there is an explanation P for 
with much better parameters: 

Lemma 14. For every distribution P and for every tuple = Xi.. .xi of 
strings of length n there is a distribution P such that: 

1) - log(P(a:i)... P{xi)) ^ - log(P(xi)... P{xi)) + 0(log(n + /)) and 

2) C{P) ^ C{P) - C{P\lt) + 0(log(n + /)). 
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To prove this lemma we need yet another one: 

Lemma 15. Let xi ,.. .xi G B”. Assume, that there are 2^ distributions P 
such that: 

1) - log(P(a:i) ... P{xi)) ^ b. 

2) C{P) ^ a. 

Then there is a distribution P of complexity at most a — k + 0(log(n + 
I + a + b)) such that — log(-P(xi)... P{xi)) ^ b. 

Proof of Lemma [73 In Lemma H] let L to be the set of probability distribu¬ 
tions and R to be the set of /-tuples of n-bit strings. Then let {xi,... ,xi) he 
adjacent to Q if log((5(a:i)... Q{xi)) ^ —b. □ 

Proof of Lemma Assume that a tuple is given. Enumerate all distri¬ 
butions Q such that C{Q) ^ a = C{P) and — log(Q(a:i)... Q(xz) ^ b = 
— log(P(xi)... P{xi)). We can retrieve P from and the ordinal number of 
P in this enumerating. Thus the logarithm of that number must be greater 
than C{P\1^) (with logarithmic accuracy). By Lemma [15] for k = C{P\1^) 
there is a probability distribution P in the enumeration whose complexity is 
at most a — k (with logarithmic accuracy). □ 

Now, we are ready to hnish the theorem. Consider some distribution 
P. We need to show that there is a distribution P such that: C{P) ^ 
C(P) -I- 0{\og{n + /)) and^hC^jP) < d{lt\P) + 0{\og{n + /)). To this end 
consider the distribution P from Lemma [TH By construction the complexity 
of P is at most that of P (with logarithmic accuracy). And its optimality 
deficiency can be bounded as follows: 

P) = C{P) - log(P(xi)... P{xi)) - C{x^, ...,xi) 

^ C{P) - C{P\^) - log(P(xi)... P{xi)) - C{x,, ...,xi) 

= 6{P,lt)-C{P\lt) = d{lt\P). □ 
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